The paaPack is designed to perform hierarchical Principal Amalgamation Analysis (HPAA) with or without the guidance of taxonomic tree structure, and provide several useful graphical tools for visualizing the results of HPAA, including 1) hierarchical dendrograms to visualize the full path of amalgamations, 2) the scree plot showing the percentage change in the diversity loss along with the changes of number of compositions, and 3) the ordination plot showing the changes in the between-sample distance patterns before and after HPAA with any given number of principal compositions (PC). In this tutorial, we also provide a R shiny app to dynamically visualize the changes of ordination plots along with the path of HPAA, i.e., from the largest number to the smallest number of PCs.
install.packages("../Rpackage/paaPack_0.0-1.tar.gz", repos = NULL, type="source")
library(paaPack)
#### functions in paaPack
help(hPAA) #### fit HPAA models
help(plotHPAA) #### dendrogram showing the hierarchical amalgamation
help(plotLine) #### scree plot showing the percentage
help(plotMDS) #### ordination plot
The paaPack provides a main function hPAA() to perform the hierarchical Principal Amalgamation Analysis (HPAA).
To use the function, the analyst should provide:
The compositional data with row representing compositions for each subject and column recording the original components/taxa.
The taxonomic tree structure as a vector with dimension same as the number of taxa. The format of the taxonomic
vector is the same as output format of the commonly used bioinformatics data processing software Mothur.
That is, each element of the vector denotes the full taxonomic ranks from kingdom to genus, species level of the taxon,
with ranks separated by semicolon. For example a typical element of the taxonomic vector could be
k_Bacteria;p_Actinobacteria;c_Actinobacteria;o_Bifidobacteriales;f_Bifidobacteriaceae;g_Bifidobacterium;s_longum.
The taxonomic structure is optional. If taxonomy is not provided, the unconstrained HPAA without tree guidance is performed.
The diversity measures used in the HPAA analysis, and indicate whether strong or weak taxonomic hierarchy is applied in the analysis.
Then a set of plotting methods are provided taking the object of class “hPAA” from the hPAA() as input, including plotHPAA()
for the dendrogram showing the full path of hierarchical amalgamation, plotLine() for the scree plot showing the percentage change in
the diversity loss with the changing in the number of principal compositions, and plotMDS() for the ordination plot showing the changes
in the between-sample distance patterns before and after HPAA. In each function, a group of graphical arguments for shaping the figures could be
specified. For details, the analyst could refer to the documentations of the functions using help().
In the following sections, we use the NICU data to illustrate the visualization tools provided by paaPack. The codes for constructing all the figures are documented in the corresponding code chunks of the source R markdown file. These tools can be extremely useful for visualizing and understanding compositional data, as well as helping to determine the desired number of principal compositions in practice.
We construct a HPAA dendrogram to simultaneously visualize both the tree diagram of the successive amalgamations and the taxonomic
structure of the taxon using the function plotHPAA(). To illustrate, Figure 2.1 shows the HPAA dendrogram from
performing HPAA with SDI loss and strong taxonomic hierarchy on the NICU data. The top part of figure shows the dendrogram of amalgamations,
where the \(y-\)axis shows the percentage decrease in total diversity as measured by SDI (on the log-scale) along the successive amalgamations,
from the bottom to the top. As such, any horizontal cut of the dendrogram at a desired level of diversity loss/preservation shows the
corresponding amalgamated data. In particular, each red dashed horizontal line indicates the steps at which the original data are aggregated
to a higher taxonomic rank. It shows that, for example, aggregating data to the order level (22 taxa or principal compositions left) through
HPAA leads to 22.3% loss in total SDI. At the bottom part, we use color bars to show taxonomic structure of the taxa, where in each horizontal
bar taxa of the same color belong to the same category of that rank.
Figure 2.1: The NICU data: Dendrogram of HPAA with SDI and strong taxonomic hierarchy.
Then we display the results under different ways of taxonomy guidance of each loss function in one combined figure for intuitive comparison. Figures 2.2 and 2.3 show HPAA dendrograms with SDI loss and BC loss, respectively, under all three levels of taxonomy guidance. Not surprisingly, the patterns of amalgamations vary under different settings. Without taxonomic constrain, the change in diversity appears to be very smooth along the amalgamations, but the resulting principal compositions may not be easily interpretable, as indicated by the mixed color patterns in the color bars of the taxonomic rank. On the other hand, for the setting of strong taxonomic hierarchy, while the principal compositions are forced to closely follow the taxonomic structure, the percentage change in diversity tends to exhibit dramatic jumps, especially at the steps that the last remaining taxon at a lower taxonomic rank is forced to be aggregated to a higher rank. As a compromise, for the setting of weak taxonomic hierarchy, the resulting principal compositions remain interpretable, and the percentage change in diversity remains smooth and can be quite close to that of the unconstrained setting in the early stage of amalgamations.
Figure 2.2: The NICU data: HPAA dendrograms with SDI and different constrains on taxonomic hierarchy.
Figure 2.3: The NICU data: HPAA dendrograms with Bray-Curtis and different constrains on taxonomic hierarchy.
Next we use the function plotLine() to construct the scree plot for the results of HPAA under different types of taxonomy guidance.
The scree plot shows the percentage change in the diversity loss as a function of the number of principal compositions.
Figure 2.4 shows the scree plots from performing HPAA on the NICU data under different settings. The difference
among the three levels of taxonomic guidance is very revealing, which confirms the previous observation from the dendrograms that
the setting of weak taxonomic hierarchy reaches a good balance between preserving information and interpretability.
Figure 2.4: The NICU data: Scree plots for HPAA (Percentage change in diversity vs. number of principal compositions).
Finally, we use plotMDS() to construct ordination plot to visualize the changes in the between-sample distance patterns before and
after HPAA with any given number of principal compositions. Specifically, in the provided function we perform the non-metric multidimensional
scaling (NMDS) analysis with Bray–Curtis dissimilarity on the combined original data and the principal compositions from HPAA, which
produces a low-dimensional ordination plot of all samples before and after amalgamation. For each sample, it is represented by a pair
of points from either the original data or the principal compositions; the smallest circle that covers the pair is drawn, whose radius
then indicates the level of distortion due to HPAA data reduction. The ordination plots from performing HPAA on the NICU data with three different
loss functions and weak taxonomic hierarchy are shown in Figure 2.5, in which 20 principal compositions are kept (the number of PCs can be
updated via the corresponding parameter of the plotMDS function). All three settings preserve the between-sample diversity reasonably well,
as indicated by the fact that the circles generally have a small radius; as expected, HPAA with the BC loss performs the best as it directly targets on preserving between-sample diversity.
Figure 2.5: The NICU data: 2D NMDS ordination plots for comparing original and principal com- positions from HPAA with weak taxonomic hierarchy.